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(54) Speech recognition. 

(57) A method and system are disclosed for reduc- 
ing perplexity in a speech recognition system 
within a telephonic network based upon deter- 
mined caller identity. In a speech recognition 
system which processes input frames of speech 
against stored templates representing speech, a 
core library of speech templates is created and 
stored representing a basic vocabulary of 
speech. Multiple caller-specific libraries of 
speech templates are also created and stored, 
each library containing speech templates which 
represent a specialized vocabulary and pronun- 
ciations for a specific geographic location and a 
particular individual. Additionally, the ( caller- 
specific libraries of speech templates are pref- 
erably processed to reflect the reduced 
bandwidth, transmission channel variations and 
other signal variations introduced into the sys- 
tem via a telephonic network. The identification 
of a caller is determined upon connection to the . 
network via standard caller identificafioft cif-' 
cuitry and upon detection of a spoken utters/ 
ance, that utterance is processed against the 
core library, if the caller's identity,; cannot be 
determined, or against a particular caller- 
specific library, if the caller's identity can be 
determined, thereby greatly enhancing the effi- 
ciency and accuracy of speech recognition by 
the system. . . 
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The present invention relates to speech recogni- 
tion systems. . ■ 

Speech recognition is well known in the prior art. 
The recognition of isolated words from a given vo- 
cabulary for a known speaker is perhaps the simplest 5 
type of speech recognition and this type of speech 
recognition has been known for some time. Words 
within the vocabulary to be recognized are typically 
prestored as individual templates, each template rep- 
resenting the sound pattern for a word in the vocabu- 10 
lary. When an isolated word is spoken, the system 
merely compares the word to each individual template 
which represents the vocabulary. This technique is 
commonly referred to as whole- word template match- 
ing. Many successful speech recognition systems 15 
use this technique with dynamic programming to cope 
with nonlinear time scale variations between the 
spoken word and the prestored template. 

Of greater difficulty is the recognition of continu- 
ous speech or speech which contains proper names 20 
or place names. Continuous speech, or connected 
words, have been recognized in the prior art utilizing 
multiple path dynamic programming. One example of . 
such a system is proposed in Two Level DP Matching 
A Dynamic Programming Based Pattern Matching Al- 25 
gorithm For Connected Word Recognition" H. Sakoe, 
IEEE Transactions on Acoustics Speech and Signal 
Processing, Volume ASS P-27, No. 6, pages 588-595, 
December 1 979. This paper suggests a two-pass dy- 
namic programming algorithm to find a sequence of 30 
word templates which best matches the whole input 
pattern. Each pass through the system generates a 
score which indicates the similarity between every 
template matched against every possible portion of 
the input pattern. In a second pass the score is then 35 
utilized to find the best sequence of templates corre- 
sponding to the whole input pattern. 

US-A-5,040,127 proposes a continuous speech 
recognition system which > processes, continuous 
speech by comparing input frames against prestored 40 
templates which represent speech and then creating 
links between records in a linked network for each 
template under consideration as a potentially recog- 
nized individual word. The linked records include an- 
cestor and descendent link records which are stored 45 
as indexed data sets with each data set including a 
symbol representing a template, a sequence indicator 
representing the relative time the link record was stor- 
ed and a pointer indicating a link record in the network 
from which it descends. ; :-\ u- 50 

The recognition of proper names represents an 
increase in so-called "perplexity n ffor speech recogni- 
tion systems and this difficulty has been recently rec- 
ognized in US-A-5,21 2,730. This patent proposes 
name recognition utilizing text-derived ^recognition 55 < 
models for recognizing the spoken rendition of proper 
names which are susceptible to multiple pronuncia- 
tions. A name recognition technique, set.forth.within 



this patent involves entering the name-text into a text 
database which is accessed by designating the 
name- text and thereafter constructing a selected 
number of text-derived recognition models from the 
name-text . wherein each text-derived recognition 
model represents at least one pronunciation of the 
name. Thereafter, for each attempted access to the 
text database by a spoken name input the text data- 
base is compared with the spoken name input to de- 
termine if a match may be. accomplished. 

US-A-5,202,952 discloses a large- vocabulary 
continuous-speech prefiltering and processing sys- 
tem which recognizes speech by converting the utter- 
ances to frame data sets wherein each frame data set 

, is smoothed to generate a smooth frame model over 
a predetermined number of frames. Clusters of word 
models which are acoustically similar over a succes- 
sion of frame periods are designated as a resident vo- 
cabulary and a cluster score is then generated by the 

• system, which includes the likelihood of the smooth 
frames evaluated utilizing a probability model for the 
cluster against which the smoothed frame model is 

. being compared. 

Each of these systems recognizes that success- 
ful speech recognition requires a reduction in the per- 
plexity of a continuous-speech utterance. Publica- 
tions which address this problem are "Perplexity-A 
Measure of Difficulty of Speech Recognition Tasks," 
Journal of the Acoustical Society of America, Volume 
62, Supplement No. 1, page S-63, Fall 1977, and the 
"Continuous Speech Recognition Statistical Meth- 
ods" in the Handbook of Statistics Volume 2: Classi- 
fication, Pattern Recognition and Reduction of Di- 
mensionality, pages 549-573, North-Holland Publish- 
ing Company, 1 982. 

This invention is directed to providing an im- 
proved speech recognition system with an enhanced 
ability to distinguish between large numbers of like 
sounding words, a problem which is particularly dif- 
ficult with proper names, place names and numbers. 

Accordingly, the invention provides a speech rec- 
ognition system in which input frames of speech data 

, are processed against stored templates representing 
speech via a telephonic network, said system charac- 

, terised by: means for storing a core library of speech 
templates; means for storing a plurality of caller-spe- 
cific libraries of speech templates; means for deter- 
mining an identification of a caller within said tele- 
phone network; means arranged to process an input 
speech utterance against said core library of speech 
templates in the event an identification of said caller 
within said telephone network cannot be determined; 
and means arranged to process an input speech ut- 
terance against a selected one of said plurality of call- 
er-specific libraries of speech templates in response 
to a determination of an identification of said caller 
within said telephonic network. 
Viewed from, another aspect the invention pro- 
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vides a method for speech recognition in which input 
frames of speech data are processed against stored 
templates representing speech via a telephonic net- 
work, said method characterised by the steps of: at- 
tempting to determine an identification of a caller 5 
within said telephone network; processing an input 
speech utterance against a core library of speech 
templates in the event an identification of said caller 
within said telephone network is not determined; and 
processing an input speech utterance against a se- • 10 
lected one of a plurality of caller-specific libraries of 
speech templates in response to a determination of 
an identification of said caller within said telephonic 
network. 

More particularly, a method and system are pro- is 
vided for reducing perplexity in a speech recognition 
system within a telephonic network based upon de- 
termined caller identity. In a speech recognition sys- 
tem which processes input frames of speech against 
stored templates representing speech, a core library 20 
of speech templates can be created and stored rep- 
resenting a basic vocabulary of speech. Multiple call- 
er-specific libraries of speech templates can also be 
created and stored, each library containing speech 
templates which represent a specialized vocabulary 25 
for a specific geographic location and a particular in- 
dividual. Additionally, the caller-specific libraries of 
speech templates can preferably be processed to re- 
flect the reduced bandwidth, transmission channel 
variations and other signal variations introduced into 30 
the system via a telephonic network. The identifica- 
tion of a caller is determined upon connection to the 
network via standard caller identification circuitry 
and upon detection of a spoken utterance, that utter- 
ance is processed against the core library, if the call- 35 
er's identity cannot be determined, or against a par- 
ticular caller-specific library, if the caller's identity 
can be determined, thereby greatly enhancing the ef- 
ficiency and accuracy of speech recognition by the 
system. 40 

The invention will best be understood by refer- 
ence to the following detailed description of an illus- 
trative embodiment when read in conjunction with the 
accompanying drawings, wherein: 

Figure 1 is a pictorial representation of a distrib- 1 45 
uted telephonic network; 
Figure 2 is a high-level block diagram of the 
speech recognition system within the host loca- 
tion of Figure 1; and f 

Figure 3 is a high-level logic flowchart which il- 1 so 
lustrates a speech recognition process- *' :J 
With reference now to the figures arid iri'particu- 
lar with reference to Figure 1, there is depicted a dis- 
tributed telephonic network. As illustrated, multiple 
user locations are coupled to a host location 12 via a 55 
public switched telephone network 10. Public switch- 
ed telephone network 10 preferably serves to couple 
multiple users via telephonic communication to host 



location 12 utilizing any one of the well known tech- 
niques for implementing such communication. For ex- 
ample, user location 14 reflects the utilization of a 
standard telephone 18 which is coupled to host loca- 
tion 12 via communication channel 32, public switch- 
ed telephone network 10 and communication channel 
30. Speech entered by a user utilizing telephone 18 
may then be recognized utilizing a speech recognition 
system implemented utilizing computer 1 6. Computer 
1 6 may be implemented utilizing any suitable comput- 
er, such as a so-called "personal 0 computer, such as 
the PS/2 computer sold by International Business 
Machines Corporation (PS/2 is a trademark of IBM 
Corporation.) 

Alternately, as depicted within Figure 1, a user 
may also utilize a mobile cellular telephone 20 which 
communicates via radio frequency transmission with 
radio tower 22: Radio tower 22 is typically coupled to 
public switched telephone network 10 utilizing a land 
line communication channel 34. Additionally, modern 
transcontinental communication is often implement- 
ed utilizing satellite communications such as illustrat- 
ed with satellite 26 and satellite receiver 24. Satellite 
receiver 24 is then coupled to public switched tele- 
phone network 10 via communication channel 36. 

As illustrated within Figure 1, a modern distrib- 
uted telephonic network provides multiple diverse 
communication channels which permit a user to es- 
tablish communication with host location 12. Each 
such communication channel will clearly vacy in 
those factors which affect the accuracy of a speech 
recognition system which is implemented utilizing 
computer 16. For example, certain communication 
channels may have a reduced bandwidth. Satellite 
systems may suffer from transmission echo or signal 
cut-out problems. Additionally, unpredictable signal 
quality, unknown microphone characteristics at vari- 
ous telephones and various regional accents also 
contribute to the difficulty of implementing a speech 
recognition system utilizing a distributed telephonic 
network such as the one depicted within Figure 1: 

Additionally, selected communication channels 
within public switched telephone networks often util- 
ize known compression algorithms or various other 
signal processing techniques which alter and vary the 
quality and content of a spoken utterance, rendering 
the recognition of that utterance more difficult than 
speech recognition within a local- system. 
w '^Referring now to Figure 2, there is depicted a 
high-level block diagram of \ the speech .recognition 

■ ' system'Whichmay be implemented utilizing computer 
»16 of Figure 1. This system illustrates the manner in 
which caller identif ication I may be utilized to decrease 

' the' perplexity of speech recognition in such a system. 

' AsJIIustrafed within Figure 2, a memory 46 is provid- 
ed within the. speech recognition system implement- 
* ed within computer 16 which includes a core library 48 
f 6f speech templates which^represent a basic vocabu- 
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lary of speech. Similarly,' multiple caller-specific li- 
braries 50 are also stored within memory 46. Each 
caller-specific library 50 preferably includes tem- 
plates which are representative of any specialized vo- 
cabulary associated with a particular geographic lo- 5 
cation which is associated with the communication 
channel typically utilized by that caller, and the data 
within those templates has preferably been altered to 
reflect the bandwidth, microphone characteristics, 
analog signal quality and various other parameters 10 
associated with a particular caller within the distribut- 
ed telephonic network of Figure 1. 

Those skilled in the art will appreciate that such 
caller-specific libraries may be created by filtering 
and processing spoken utterances through a network 15 
which models the communication channel through 
which the spoken utterance must be detected. Fur? 
ther, it should be apparent upon reference to this 
specification that each caller-specific library may in- 
clude a series of speech templates which are repre- 20 
sentative of specific geographic locations, business 
establishments, or proper names which are germane 
to a selected geographic location associated with the 
identity of a selected caller within the distributed tel- 
ephonic network. 25 

Thus, each time communication is established 
between a user and the speech recognition system 
implemented within computer 16 and a speech utter- 
ance is detected, that utterance is preferably suitably 
converted for processing utilizing an analog-to-digital 30 
converter 42 and coupled to processor 40. Processor 
40 then utilizes caller identification signals which are 
available from the public switched telephone network 
in conjunction with caller identification circuitry 44 to 
establish the identity of a particular caller by identify- 35 
ing the telephone instrument utilized. As those skilled 
in the art will appreciate, that identity will yield useful 
information with regard to the geographic location of 
a particular caller and the communication channel 
parameters which are typically associated with that 40 
particular caller, based upon the typical communica- 
tion path encountered between that caller and the 
host location. 

Thus, the. output of caller identification circuitry 
44 is utilized by processor 40 to permit processor 40 45 
to select a particular^one of the multiple cailer-specif- 
ic libraries 50 contained within memory 46. The input 
frame of speech data is then compared to a library of 
speech templates within memory 46 to determine the 
content of the speech utterance. Processor 40 may 50 
then generate an output signal which may be utilized 
to control access to other data,' implement a particular 
activity or otherwise provide for the verbal control of 
a peripheral system. 

Upon reference to the foregoing, those skilled in 55 
the art will appreciate that core library 48 may be util- 
ized to provide a standardized series of templates for 
utilization in those sftuationsin "which the caller Iden- 



tification may not be determined or, alternatively, 
core library 48 may comprise a series of basic vo- 
cabulary templates which are combined with a caller- 
specific library to reflect particular geographic spe- 
cific vocabulary items or those spoken utterances 
which are greatly affected by transmission parame- 
ters within the communication channel. In either 
event, processor 40 processes an input spoken utter- 
ance against a library within memory 46, utilizing call- 
er. identification 44 to select a caller-specific library, 
thereby greatly enhancing the efficiency and accura- 
cy of the speech recognition system implemented 
within computer 1 6. 

As discussed above with respect to previous 
known attempts at speech recognition, the templates 
against which input speech are processed may com- 
prise templates representing individual words, phras- 
es or portions of words. As utilized herein, the term 
"template" shall mean any stored digital representa- 
tion which may be utilized by processor 40 to identify 
an unknown speech utterance. 

Finally, with reference to Figure 3, there is depict- 
ed a high-level logic flowchart which illustrates a 
speech recognition process. As illustrated, this proc- 
ess begins at block 60 and thereafter passes to block 
62. Block 62 illustrates the establishment of a con- 
nection by the user to the host location via a public 
switched telephone network (see Figure 1). Next, the 
process passes to block 64. Block 64 illustrates a de- 
termination of whether or not a verbal utterance has 
been detected. If not, the process merely iterates until 
such time as an utterance has been detected. How- 
ever, once a verbal utterance has been detected, the 
process passes to block 66 

Block 66 illustrates a determination of whether or 
not the caller identification may be determined from 
the telephonic network. Those skilled in this art will 
appreciate that caller identification is not universally 
applicable and thus, the identification of a particular 
caller accessing the system may not be determined. 
However, in the event the caller identification has 
been determined, the process passes to block 68. 
Block 68 illustrates a selection of a particular caller- 
specific library from within memory 46 (see Figure 
2). As described above, a particular caller-specific li- 
brary preferably contains speech utterances which 
have been processed to accurately reflect the trans- 
mission parameters which affect verbal communica- 
tion within the communication channel. Thus, band- 
width limitations, processing techniques and other 
parameters which affect verbal communication have 
been utilized to create a speech template which more 
accurately reflects utterances which have beeri proc- 
essed through that channel. Additionally, specific vo- 
cabulary words and pronunciations unique to a par- 
ticular geographic area associated with the identifica- 
tion of that caller are also included within the caller- 
specific library. For example, a system which permits 
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the verbal accessing of airline flight schedules will 
preferably have a series of caller-specific templates 
which are designed to include utterance representa- 
tion of geographic locations within the vicinity of the 
. caller's location, as determined utilizing the telephon- 5 

ic network, as more likely choices for recognition than 
geographic locations which are disposed a great dis- 
tance from that location. 

After selecting a particular caller-specific library 
based upon the caller identification determination, 10 
the process passes to block 70. Block 70 illustrates 
the processing of the utterance against that caller- 
specific library. Of course, as described above, the 
processing of an input speech utterance against a 
caller-specific library of templates may also, include 15 
the processing of that utterance againsta core library 
of common utterances which may be easily recog- 
nized, despite any degradation which occurs as a re^- 
4 suit of transmission through a communication chan- 

*) - net within the public switched telephone network. 20 

J Referring again to block 66, in the event the caller 

/ t identification cannot be determined, the process 

passes to block 72. Block 72 illustrates the process- 
ing of the input speech utterance against a core libra- 
ry which may be utilized for those circumstances in 25 
which a caller identification cannot be determined. 
Thereafter, after processing the input speech utter- 
ance against a core library or a caller-specific library 
alone or in conjunction with some basic core library, 
the process passes to block 74. Block 74 illustrates a 30 
determination of whether or not the utterance has 
been recognized. If trie .utterance is not recognized, 
that is, there exists no high probability match be- 
tween that utterance and a known template, the proc- 
ess passes to block 76. Block 76 illustrates the gen- 35 
eration of a suitable error message and the process 
then passes to block 78 and returns, Those skilled in 
the art will appreciate that communication may be ter- 
minated at this, point, or the speaker may be urged to 
attempt once again to pronounce the utterance in a 40 
manner which may eventually lead to recognition of 
that utterance. 

Referring again to block 74, in the event the ut- 
terance has been recognized, the process passes to 
block 80. Block 80 then illustrates the processing of 45 
that utterance. Those skilled in the art will appreciate 
that ^processing an utterance" means the utilization 
of the linguistic or intelligence content of that utter- 
ance to access other data, perform some function, or 
in some way interact with a peripheral system of cony- 50 
puter16to provide ah intelligent response to t)Ve spok- 
en utterance or transcribe it. Thereafter, the process 
passes to block 78 and returns. 

Upon reference to the foregoing, those skilled in 
the art will appreciate that by combining a speech rec- 55 
ognition system in a plurality of caller-specific librar- 
ies of speech templates with an existing caller identi- 
fication system, a speech recognition system may be 
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provided which is greatly enhanced in efficiency and 
accuracy as the spoken input utterances may be 
more accurately recognized by virtue of processing 
which takes into account signal variation which oc- 
curs as a result of the communication channel within 
a telephonic network and also processing which 
takes into account variation in geographic-specific 
vocabulary and pronunciation and other linguistic 
phenomena. 

While the invention has been particularly shown 
and described with reference to a preferred embodi- 
ment, it will be understood by those skilled in the art 
that various changes in form and detail may be made 
therein without departing from the scope of the inven- 
tion. 



Claims 

1. A speech recognition system in which input 
frames of speech data are processed against 
stored templates representing speech via a tele- 
phonic network, said system characterised by: 

means for storing a core library of speech 
templates; 

means for storing a plurality of caller-spe- 
cific libraries of speech templates; 

means for determining an identification of 
a caller within said telephone network; 

means arranged to process an input 
speech utterance against said core library of 
speech templates in the event an identification of 
said caller within said telephone network cannot 
be determined; and 

means arranged to process an input 
speech utterance against a selected one of said 
plurality of caller-specific libraries of speech tem- 
plates in response to a determination of an iden- 
tification of said caller within said telephonic net- 
work. 

2. A speech recognition system as claimed in claim 
1 , wherein said means for' determining an identi- 
fication of a caller within said telephonic network 
comprises means for utilizing a caller ideritifica- 

\ tiori system provided "within said telephonic net- 
work to determine an identification of said caller. 

3. A speech recognition system as claimed in claim 
I or claim 2, wherein the plurality of caller-sp'ecif- 
ic libararies are processed to reflect variations in 
a speech* utterance which occur as a result of 
transmission within said telephonic network. 

4. A speech recognition system as claimed in any 
preceding claim, wherein said plurality of caller- 
specific libraries indude vocabulary pronuncia- 

| tions reflective of specific geographic locations. 
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5. A method for speech recognition in which input 

frames of speech data are processed against rj 
stored templates representing speech via a tele- 
phonic network, said method characterised by 

the steps of: s * 

attempting to determine an identification 
of a caller within said telephone network; 

processing an input speech utterance 
against a core library of speech templates in the 
event an identification of said caller within said 10 
telephone network is not determined; and 

processing an input speech utterance 
against a selected one of a plurality of caller-spe- 
cific libraries of speech templates in response to 
a determination of an identification of said caller 15 
within said telephonic network. . 

6. A method as claimed in claim 5, wherein said step 

of determining an identification of a caller within * ^ 

said telephonic network comprises the step of 20 r-, 
utilizing a caller identification system provided 

within said telephonic network to determine an * V 

identification of said caller. 



7. A method as claimed in claim 5 or claim 6 com- 25 
prising creating and storing a plurality of caller- 
specific libraries of speech templates comprises 
the step of creating and storing a plurality of call- 
er-specific libraries of speech templates which 
are processed to variations in a speech utterance 30 
which occur as a result of transmission within 
said telephonic network. 



8. A method as claimed in any of claims 5 to 7 com- 
prising creating and storing a plurality of caller- 35 
specific libraries of speech templates comprises 
the step of creating and storing a plurality of call- 
er-specific libraries of speech templates which in- 
clude a vocabulary and pronunciation reflective 
of specific geographic locations. 40 
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